L01: Introduction
  Course overview, inc. some motivation.
  Programming vs. AI vs ML definitions, use example of a car:
    programming = engine management
    AI = GPS
    ML = Automatic driving (signpost recognition).
  The classic steps:
    Get Data
    Select Model
    Fit
    Verify
  The classic problem:
    Classification
    Supervised
    Learn once
  Introduce zoo animal data set
  Categorical distribution, inc. fitting.
    Summary, always guessing the most common.
    Summarising for different buckets.
  Problems:
    Insufficient data for all bins - some are empty. ML is about regularisation.
    Demo a data set mismatch.
  Naive Bayes.
    Include -log(P(x)) 'cost' trick for numerical stability.
  Assumes independence - will fix next lecture.


L02: Decision Trees
  Decision trees - avoids independence assumption. Discrete only.
  Get them to construct some manually.
  Entropy and information gain.
  Brute force discovery of best split.
  Demo for zoo animals.
    Discuss train/test split (later lecture for detail)
    Introduce hyper-parameters, and manually trying several.


L03: Regression Forests
  Variable kinds:
    Discrete
    Real - inc. with limits.
    Directional (mention vector representation)
    Other types, but not really going to discuss, e.g. matrices, graphs, text. Can convert into above.
  Decision trees with real features.
  Demo with wine quality data set.
  Bagging with bootstrap draws.
  Random forests.
  Extra randomness - don't need to brute force every option.
  Out of bag error.
  Classification vs regression.
  Regression trees/forests.


L04: *-supervised
  Supervised vs unsupervised vs semi-supervised learning.
  Mention there are other kinds, e.g. weakly supervised, reinforcement learning with delayed reward.
  Clustering.
  K-means (unsupervised)
  N-nearest graph algorithms.
  Random walker (semi-supervised) Leo Grady approach? Needs zombies...


L05: Optimisation for Pirates
  'In practise optimisation is everywhere' - use examples from previous lectures.
  Introduce univariate and multivariate regression.
  Idea of a cost function, some examples, other terms such as loss function.
  Analytic answer with x^2 cost.
  Analytic vs. optimisation.
  Lots of cost functions for regression, with why you might use each.
  Basic optimisation approaches:
    Brute force - hah! But sometimes...
    Random values.
    Genetic algorithms as 'slightly better than random'. Meh.
    Grid search - mention that random is usually better.
  Demo combinations of cost/optimisation approach.
  Introduce over/under determined equations, degrees of freedom.
  Line fitting when you have the right number of points.
  RANSAC, as combination analytic/optimisation technique.


L06: Optimisation for Ninjas using Gradient Descent
  Continuous vs. discontinuous cost functions.
  Plot ax+b for many cost functions.
  Gradient descent.
  Convergence - local vs. global minima.
  Importance of initialisation.
  Analytic differentiation vs. Central differences. Use latter to verify former.
  Automatic differentiation, with tensor flow demo.


L07: Logistic Regression
  Importance of uncertainty.
  Logistic Regression.
    Histograms for continuous variables, mention triangular kernels.
    Being hierarchical by repeating variable.
  Demo of election polling with Mr. P algorithm (Andrew Gelman)
  Overview of generalised linear model - not a lot of detail.


L08: Is it working?
  Train vs test set.
  Tuning set for hyper-parameters.
  n-fold.
  Mention out of bag error again. Jackknife variant.
  Model is not reality.
    Bad model vs. bad fitting vs. bad data.
    Good test results do not mean you can trust any other question you ask of the model.
  Examples of ML failing.
    Tank example - bad data.
    WW2 bombers - confirmation bias.
    Hospital triage for breathing difficulties/asthma example.
    Need more!
  Visualisation to avoid problems.
  Visualisation can also be misleading!
    Use perception of gradients as an example.


L09: Regularisation
  Overfitting and underfitting reminder.
    Use n-nearest neighbour to demonstrate. Mention 'this works if you have infinite data'?
  Generative vs discriminative.
  Regularisation
    For better answers - avoid overfitting.
    For optimisation.
    For human understanding.
    For some combination of the above.
  Simple model to complex model.
    Scale space when you have a spatial or temporal arrangement, e.g. images.
  Objectives:
    Maximum likelihood
    Maximum a posteriori (MAP)
    Bayesian (Please all stare at the Bayesian-hypno-toad)
  Example optimised with all of the above - need to find something simple.
  Types of prior - including uninformative (discuss fuzziness...), improper, human driven, data driven, minimum description length.
  Generative vs not-generative models. (Gelman distinction) Implicit models?
  Stupid approaches: Early stopping and optimising the wrong cost function badly.


L10: Curse of Dimensionality
  Introduce the curse, give some examples.
  Feature design (+filtering?) as solution, with examples.
  Invariance and equivariance.
    Example from 'Random forests for metric learning with implicit pairwise position dependence'.
    Steerable filters.
    Maybe SIFT?
  Dimensionality reduction & manifolds.
  PCA
  Maybe ISO-MAP, time permitting.
  Mention, so they are aware of:
    Learning features with convolutional NN.
    Using conditional independence, as motivation for next 4 lectures.


L11: Graphical Models
  Introduce as factorisation of a function.
  Demo breaking up some distributions.
  Representations:
    Factor graphs (from SAT)
    Bayesian network
    Markov random field
  Conversion between; don't go overboard.
  Examples.
  No requirement they represent causality.
  Markov random chain, be clear on Markov property.
    Drawing - demo using English language pairwise stats to generate fake words!
    MAP - dynamic programming, demo using English language to guess missing letters.
    Fitting from data - need to decide how far to go with regularisation.
  Be clear about exact meaning of dynamic programming.
  Dynamic programming tricks - known sequence and fixing the last item/making it loop.
  Kalman filtering/smoothing, with accelerometer data, as example of a Markov random chain with continuous random variables.


L12: Belief Propagation
  Introduce a medical scenario (try and avoid anything someone might freak out over.).
  Fitting data when all values known.
  Belief propagation to find marginals and MAP.
  Bayesian decision theory for deciding what to treat (or to collect more information).
  Ask students what they would do before calculating, to demo how crap we are at this.
  Could be interesting to show a variety of extreme cases, e.g. problems where you always treat, or never treat.


L13: Do you like Terminator? (once demo is done maybe use least liked film!)
  Collaborative filtering/recomender systems.
  Plate notation
  Latent random variables.
  Latent Semantic Analysis/Indexing.
  Worked example for netflix-like data.


L14: Variational techniques
  Concept of a variational algorithm.
  Worked example for something simple.
  Mean field approach.
  Topic models, LDA.
  Work through variational model and demo on Reuters data set.


L15: Optimisation for Wizards
  Overparameterizisation, demo on ellipses. (Fitzgibbon' stuff!)
  Multiple restart.
  Solving a simpler problem first - hierarchy of complexity.
  If you can blur your cost function...
  Line Search for setting step size.
  Rosenbrock's valley (Banana function)
  Subset of dimensions at a time, with above for warnings.
  Simple approach - reduce step size if not improved in a while.
  Momentum - Nestorov, inc. visual explanation from 'I am not a bandit' blog. Heavy ball motivation.
  Newtons method, inc. multivariate.


L16: Support Vector Machines
  Introduce SVM.
  How to solve it with linear programming (Only 2D, so soft introduction).
  Kernel trick.
  Comment on SVM vs. random forest vs. neural networks, inc. data size.


L17: Gaussian Processes
  Parametric vs. non-parametric
  Gaussian processes - basically first chapter of GP book.
  Variety of covariance functions, inc. periodic and discrete.


Dropped:
L18: Hyper-parameter Optimisation
  Nature of problem, compared to normal optimisation - time to query a point and no gradient.
  Bayesian quadrature.
  Multi-armed bandit.
  Thompson sampling.
  Demo on a real problem.

